FAIR, ethical, and coordinated data sharing for COVID-19 response: a scoping review and cross-sectional survey of COVID-19 data sharing platforms and registries

Summary Data sharing is central to the rapid translation of research into advances in clinical medicine and public health practice. In the context of COVID-19, there has been a rush to share data marked by an explosion of population-specific and discipline-specific resources for collecting, curating, and disseminating participant-level data. We conducted a scoping review and cross-sectional survey to identify and describe COVID-19-related platforms and registries that harmonise and share participant-level clinical, omics (eg, genomic and metabolomic data), imaging data, and metadata. We assess how these initiatives map to the best practices for the ethical and equitable management of data and the findable, accessible, interoperable, and reusable (FAIR) principles for data resources. We review gaps and redundancies in COVID-19 data-sharing efforts and provide recommendations to build on existing synergies that align with frameworks for effective and equitable data reuse. We identified 44 COVID-19-related registries and 20 platforms from the scoping review. Data-sharing resources were concentrated in high-income countries and siloed by comorbidity, body system, and data type. Resources for harmonising and sharing clinical data were less likely to implement FAIR principles than those sharing omics or imaging data. Our findings are that more data sharing does not equate to better data sharing, and the semantic and technical interoperability of platforms and registries harmonising and sharing COVID-19-related participant-level data needs to improve to facilitate the global collaboration required to address the COVID-19 crisis.


Approach to harmonization Data types
Platform 1 Combines big data tools and infrastructure.Major investment to continuously store, manage, mine big data sets (e.g.OMICs, imaging data).

Retrospective or prospective
May be limited to 1 data type or include various prespecified data types Registry 2 Collection of data stored in an assigned location.Low level of investment needed.Data generally entered or uploaded using the same case report form/data dictionary and the focus is on a particular disease, condition, or exposure.
Prospective Generally limited to 1 specific data type Dataverse 3 Open source web application to share, preserve, cite, and explore research data of various types and with varying objectives.
Data is in its original form and not harmonized Any Datahub 4 Data store that is an integration point for multiple datasets with different structures.Data are moved and stored together, however access permissions vary by data contributor.

Generally involves harmonization of data
Any Data lake 4 Central repository or pool of raw and untransformed data of any data type for an undefined purpose and requires other add-on tools to search or operationalize the data.Requires a low-level of investment.
Data is in its original form and not harmonized Any Data warehouse 5 Data management tool that contains structured, filtered data that has already been processed and refined for a specific purpose allowing end users to perform further analytics.

No harmonization Any
Data federation 6 Technology wherein the data stored in different data sources are made accessible as one integrated virtual database and can be queried, transformed and accessed by data consumers.Data federation is a subset of data virtualization.
Data federation involves transformation, cleansing, and at times, the enrichment of data Any Data virtualization 4 Data virtualization evolved from data federation with additional features and functionalities.
According to different software developers, data virtualization has several capabilities beyond data federation including advanced security, query processing, and data transformation features.

Same as data federation
Any Data catalogue 7 Website with linkages to available datasets or platforms.

Supplementary Note 1. Natural Language Processing Strategy & Source Code
We applied natural language processing (NLP) to the Covid-19 Open Research (CORD-19) Dataset 8 to identify additional COVID-19-related data sharing platforms and repositories.NLP was a useful approach to dealing with the CORD-19 resources in English as well as different languages because processing this data through automation is difficult to do without an understanding of the way humans speak and write naturally.We used two main methods to match potential titles containing COVID-19 related data sharing platforms.Firstly, we looked at proper nouns, serving the roles of named entities and acronyms, which take the role of the root of the sentence.Additionally, we matched appositional modifiers to the target search terms to pick up any missed items that the initial algorithm did not pick up, due to the format the title was written in its respective language.
The initial NLP was conducted in a Jupyter Notebook environment using R.For the initial NLP approach, we singled out titles in the CORD-19 database with desired relatability to the publication using the keywords: registry, registries, database, databases, platform, platforms, repository, repositories, IPD-MA, individual participant data meta-analysis, and data dashboard.The original NLP R source code can be found at the following link: https://github.com/matiasbross/NLPCode.
We later updated the NLP approach using the open-sourced Python NLP library -SpaCy. 9,10e singled out titles in the CORD-19 database with desired relatability to the publication using the keywords: registry, registries, database, databases, platform, platforms, repository, repositories, IPD-MA, individual participant data meta-analysis, and data dashboard.
The updated NLP Python source code can be found at the following link: https://github.com/AdmiralVanko/Cord19-Scraper.

Supplementary Note 2. Quantitative evaluation of the adherence of registries with clinical participant-level COVID-19 data to the FAIR Principles
We limited the quantitative evaluation to registries for participant-level clinical data because we could not apply the same metrics for resources that shared different data types.For example, participant-level clinical registries have restricted access due to the sensitive nature of the data whereas databases for sharing pathogen OMICs data are open access.While discipline-specific FAIR criteria should be developed using a diverse panel of experts and stakeholders, we applied indicators used by the FAIRshake tool 11 algorithm to better align the tool's evaluation with the specific concerns that we thought would be most important to end users of registries of clinical data.Only one of the registries that collect and harmonize COVID-19 participantlevel clinical data had been assigned a DOI prior to our review of the registries.Eighteen of the registries were assigned a DOI by FAIRsharing as part of our evaluation.Seventeen of the registries that we contacted to assign a DOI did not respond to these inquiries, and we could not quantify their FAIRness.
Below, we review the criteria used to create a preliminary rubric for evaluating registries' adherence to the FAIR principles.These draft criteria will be presented to the Research Data Alliance, an international network of individuals and groups working to improve FAIR data.Blue text indicates metrics from our research team's dataset.Green text is used for metrics from the FAIR Data Maturity Model Specification and Guidelines 2020. 12Text with a strikethrough indicates text that was removed from the corresponding indicator in our dataset.Preliminary criteria for the application of the FAIR assessment rubric Findable • PID (unique & persistent identifier) for the data (RDA-F1-01D / RDA-F1-02D) Does the repository provide PIDs for the datasets therein?Value: Values will be the same for all the COVID-19 resources we assessed as it isn't clear (in a machine-actionable manner) what kind of PID they use, as we don't have data access.This means all the registries will fail this indicator.
• Annotation with metadata (RDA-F2-01M) This is the only thing that could differ between resources: the quality / quantity of metadata annotation.
We can consider 3 levels: i) nothing -> fail ii) minimum (contact, description, to be defined) -> medium iii) rich (? to be defined) This could also be the sum of the criteria filled out in the WHO survey.Value: Consider the metric as a success for every registry, as this was a criterion to enter them in FAIRsharing (a minimum set of metadata must be required to be inserted into FAIRsharing).
• PID (unique & persistent identifier) for the metadata (RDA-F1-01M / RDA-F1-02M) Does the metadata from the repository are assigned a unique & persistent identifier?Value: Values will be the same for all the COVID-19 resources because they have a PID for metadata in FAIRsharing.
• Findable on search engines (RDA-F4-01M) Are the registries findable on search engines?(we can check if they are marked up with Schema.orgDid not assess whether the registers were present on portals or institutional websites.Value: every registry gave a success except "European Renal Association COVID-19 Database" (when searching on Google, can't find https://www.eracoda.org/link, but I can access thanks to FAIRsharing link or others websites that redirect to the link).

Accessible
• Standard protocol and secured standard protocol (https, ftps) (RDA-A1-04M / RDA-A1.1-01M)Is the metadata accessible via standard protocols such as HTTPS and FTPS?Value: Values are the same for all the registries, as they can be accessible by https website.
• Authentication secure (RDA-A1.2-01D)Is sensitive data accessible by secure authentication?Value: Consider REDCap secure (=success).For registries we don't know if there is an authentication (REDCap is not used), assigned a "Not Clear" value for these cases.
• Metadata accessibility on the long term (RDA-A2-01M) Will the metadata be accessible in the long term even if the resource disappears?Value: Values will be the same for all the COVID-19 resources because all the resources have a PID for metadata on FAIRsharing.
• Contact information (no correspondence with RDA) Is there any contact information available on the website (not sure that we should make a distinction between a "registry contact" and a PI contact: a registry contact is better for sustainability but there is a chance that these rapidly emerging resources will disappear just as quickly and, in this case, a PI contact is better)."Registry email" "PI email(s)" "Registry contact email(s)" Value: Marked as successful only if one of these criteria is met.• Contact information valid (no correspondence with RDA) In the context of this type of repository, it is important that the contact responds.The WHO team sent a survey to the contact and they received or not an answer.I think we can only consider "responded to survey.""Responded to survey with detailed questions about data types, sharing, and governance" "Notes from investigator on how to access data" Value: Marked as successful if "Responded to survey with detailed questions about data types, sharing, and governance" is met.
• Data access (RDA-A1-01M) Is there a clear description of the access to the data?"Link to description of how to access data" "Link to clearly specified governance mechanism for reviewing data access requests" "Link to clearly specified criteria for reviewing data access requests" "Criteria for reviewing data access requests (from REDCap)" "Who controls access to the data" Value: Ignored "Criteria for reviewing data access requests (from REDCap)."Averaged the other 4 criteria (green: 1 ; red: 0 ; yellow: 0,5).Removed the "who controls access to the data," it doesn't bring anything.
• Data sharing (no correspondence with RDA) Is the data shared?As raw data is not directly accessible, here we can assess if some summary / reports / data dashboard / scientific articles are available."Data sharing status" "Investigator explanation for why data won't be shared (write N/A if data will be shared)" "Is there a data dashboard / articles / reports available?"Value: Ignored "Investigator explanation for why data won't be shared (write N/A if data will be shared)."Marked as successful only if one of the two criteria is met.Interoperable • Use of a controlled vocabulary (RDA-I1-01D) Does the data use a knowledge representation expressed in a standardised format?It can be assumed here that the use of forms to insert patient data allows the use of a controlled vocabulary."Link to COVID-19 CRF or data dictionary" Value: Success only if one of the two criteria is met.
• Use of a FAIR controlled vocabulary (RDA-I2-01M / RDA-I2-01D) Does the data use a knowledge representation expressed in a FAIR standardised format?We can remove OMICS standards and Imaging data standards because not appropriate for these registries."What formal standards does the platform apply for human OMICs data access?""Connection between CRF and existing standards (e.g., ICD 9-11, CDASH, SNOMED, LOINC)" "Uses ISARIC/WHO CRF (case report form)?" "Clinical-epidemiological standards used by registry" "OMICs data shardards used by registry" "Imaging data standards used by registry" Value: no known standard could be identified in the registries (except for the Extracorporeal Life Support Organization Registry which uses a Clinical-epidemiological standard) Removed OMICS and imaging standards as we only look at registries.Removed the connexion between CRF and existing standard (removed from WHO spreadsheet + doublon with the use of clinical-epidemiological standards used by the registry).
• Data contextualisation (related resources) (RDA-I3-01M) Are there links to platforms in the same field to contextualise the register?Are there links to clinical trials on ClinicalTrials.gov?"Links to related platforms" Value: success if yes, failure if no.Reusable • Licence (RDA-R1.1-01M)Is there a clear and accessible licence for re-use?"Data usage license" Value: success if yes, failure if no.Yes if found a link "terms of use," "terms of service," "copyright notice" on the corresponding website • Source of data (no correspondence with RDA) Does metadata include provenance information?"Who can enter data?(anyone, registered users of the platform, the platform hosts)" "How is data entered?(can data be uploaded?Is this through a REDCap data entry platform, etc?)" Value: I averaged the 2 criteria.
• Use of community standard (RDA-R1.3-01M,RDA-R1.3-01D/ RDA-R1.3-02M/ RDA-R1.3-02D)Does data and metadata comply with a community standard?Is data and metadata expressed in compliance with a machine-understandable community standard?These are the results of: "Use of a FAIR controlled vocabulary."Almost all registries missed meeting this criterion.Value: Took the same results as the "Use of a FAIR controlled vocabulary" criteria (meaning failure for most of the registries).I just improved the score of the "except for the Extracorporeal Life Support Organization Registry" and "Discovery VIRUS COVID-19," as one standard is not sufficient to meet this criterion.Specify characteristics of the sources of evidence used as eligibility criteria (e.g., years considered, language, and publication status), and provide a rationale.

2,22
Information sources* 7 Describe all information sources in the search (e.g., databases with dates of coverage and contact with authors to identify additional sources), as well as the date the most recent search was executed.

2,22
Search 8 Present the full electronic search strategy for at least 1 database, including any limits used, such that it could be repeated.

22
Selection of sources of evidence †

9
State the process for selecting sources of evidence (i.e., screening and eligibility) included in the scoping review.

2,22
Data charting process ‡ 10 Describe the methods of charting data from the included sources of evidence (e.g., calibrated forms or forms that have been tested by the team before their use, and whether data charting was done independently or in duplicate) and any processes for obtaining and confirming data from investigators.

2,12
Data items 11 List and define all variables for which data were sought and any assumptions and simplifications made.

2,12
Critical appraisal of individual sources of evidence §

12
If done, provide a rationale for conducting a critical appraisal of included sources of evidence; describe the methods used and how this information was used in any data synthesis (if appropriate).

Synthesis of results 13
Describe the methods of handling and summarizing the data that were charted.2,12

Selection of sources of evidence 14
Give numbers of sources of evidence screened, assessed for eligibility, and included in the review, with reasons for exclusions at each stage, ideally using a flow diagram.

SECTION ITEM PRISMA-ScR CHECKLIST ITEM REPORTED ON PAGE # Characteristics of sources of evidence 15
For each source of evidence, present characteristics for which data were charted and provide the citations.
Click here to enter text.

Critical appraisal within sources of evidence 16
If done, present data on critical appraisal of included sources of evidence (see item 12).
Click here to enter text.

Results of individual sources of evidence 17
For each included source of evidence, present the relevant data that were charted that relate to the review questions and objectives.† A more inclusive/heterogeneous term used to account for the different types of evidence or data sources (e.g., quantitative and/or qualitative research, expert opinion, and policy documents) that may be eligible in a scoping review as opposed to only studies.This is not to be confused with information sources (see first footnote).
‡ The frameworks by Arksey and O'Malley (6) and Levac and colleagues (7) and the JBI guidance (4, 5) refer to the process of data extraction in a scoping review as data charting.
§ The process of systematically examining research evidence to assess its validity, results, and relevance before using it to inform a decision.This term is used for items 12 and 19 instead of "risk of bias" (which is more applicable to systematic reviews of interventions) to include and acknowledge the various sources of evidence that may be used in a scoping review (e.g., quantitative and/or qualitative research, expert opinion, and policy document).

Transparent governance
The process for sharing data and facilitating access should be clearly explained, outlining how and when the data can and cannot be shared and defining the associated descriptors of the data.
Be transparent in the use of personal data and respect the privacy and confidentiality of individuals, complying with legal requirements and ethical expectations at all times.
Key policies on publications, intellectual property, and industry involvement should be public.
Websites that are accessible to the general public serve to provide feedback on progress and general results.
Develop clearly defined and accessible information on the purposes, processes, procedures and governance frameworks for data sharing.

Compliance with data protection laws
Be transparent in the use of personal data and respect the privacy and confidentiality of individuals, complying with legal requirements and ethical expectations at all times.
Security: Trust and the promotion of data sharing rely on data management and security mechanisms and also on oversight of their functioning.Mechanisms for identifying and tracking data generators and users should be international.
Privacy, Data protection, Confidentiality: Comply with applicable privacy and data protection regulations at every stage of data sharing.

Evaluate
, a comprehensive open-access real-time platform of registered clinical studies for COVID-19 Does not collect, harmonize, share COVID-19 participant-level data The Brighton Collaboration standardized template for collection of key information for risk/benefit assessment of a Modified Vaccinia Ankara (MVA) vaccine platform Does not collect, harmonize, share COVID-19 participant-level data COVID-19 and its sequelae: a platform for optimal patient care, discovery and training Does not collect, harmonize, share COVID-19 participant-level data Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation Does not collect, harmonize, share COVID-19 participant-level data FragMAX: the fragment-screening platform at the MAX IV Laboratory Does not collect, harmonize, share COVID-19 participant-level data opvCRISPR: One-pot visual RT-LAMP-CRISPR platform for SARScov-2 detection Does not collect, harmonize, share COVID-19 participant-level data PURY: a database of geometric restraints of hetero compounds for refinement in complexes with macromolecular structures Does not collect, harmonize, share COVID-19 participant-level data HIT-COVID, a global database tracking public health interventions to COVID-19 Does not collect, harmonize, share COVID-19 participant-level data A collection of designed peptides to target SARS-Cov-2 -ACE2 interaction: PepI-Covid19 database Does not collect, harmonize, share COVID-19 participant-level data SPDB: a specialized database and web-based analysis platform for swine pathogens Does not collect, harmonize, share COVID-19 participant-level data MarkerDB: an online database of molecular biomarkers Does not collect, harmonize, share COVID-19 participant-level data No. Title of citation for potential tool Assessment of utility as resource for collecting, harmonizing and sharing COVID-19 participant-level data ROBOCOV: An affordable open-source robotic platform for COVID-19 testing by RT-qPCR Does not collect, harmonize, share COVID-19 participant-level data Neurological manifestations associated with COVID-19: a multicentric registry Duplicate (70) The spectrum of COVID-19-associated dermatologic manifestations: an international registry of 716 patients from 31 countries Registry that collects participant-level cross-sectional clin-epi data about patients with dermatologic conditions and COVID-19, harmonizes the data, and shares this data No. Title of citation for potential tool Assessment of utility as resource for collecting, harmonizing and sharing COVID-19 participant-level data AlzGPS: a genome-wide positioning systems platform to catalyze multi-omics for Alzheimer's drug discovery Does not collect, harmonize, share COVID-19 participant-level data Swab-Seq: A high-throughput platform for massively scaled up SARS-CoV-2 testing Does not collect, harmonize, share COVID-19 participant-level data Engineering organoids: a promising platform to understand biology and treat diseases Does not collect, harmonize, share COVID-

Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) Checklist
Green: the FAIR criteria are met; Red: the FAIR criteria are not met; Yellow: insufficient information.ASCO=American Society of Clinical Oncology.ASH=American Society for Hematology.CRF=case report form.CVD=cardiovascular disease.ftps=file transfer protocol secure.https=hypertext transfer protocol secure.ISARIC=International Severe Acute Respiratory and Emerging Infection Consortium.MS=multiple sclerosis.PI=principal investigator.PID=persistent identifier.SECURE=Surveillance Epidemiology of Coronavirus Under Research Exclusion.smtp=simple mail transfer protocol.WHO=World Health Organization.

Table 1 (
Describe sources of funding for the included sources of evidence, as well as sources of funding for the scoping review.Describe the role of the funders of the scoping review.23 JBI = Joanna Briggs Institute; PRISMA-ScR = Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews.* Where sources of evidence (see second footnote) are compiled from, such as bibliographic databases, social media platforms, and Web sites.

Title of citation for potential tool Assessment of utility as resource for collecting, harmonizing and sharing COVID-19 participant-level data
Registry that collects participant-level cross-sectional clin-epi data about patients with psoriasis and COVID-19, harmonizes the data, but does not share the participant-level data CMAUP: a database of collective molecular activities of useful plants Does not collect, harmonize, share COVID-19 participant-level data

Title of citation for potential tool Assessment of utility as resource for collecting, harmonizing and sharing COVID-19 participant-level data
Registry that collects participant-level longitudinal clin-epi data about patients with neurological conditions and COVID-19, harmonizes the data, and shares this data.Has several prospective cohorts including adult and pediatric cohorts Propedia: a database for protein-peptide identification based on a hybrid clustering algorithm Does not collect, harmonize, share COVID-19 participant-level data icumonitoring.ch:a platform for short-term forecasting of intensive care unit occupancy during the COVID-19 epidemic in Switzerland

R Principles of Sharing Data in Public Health Emergencies 13 COVID-19 NCS Data Sharing Principles 14 International Code of Conduct for Data Sharing in Genomic Research 15 GA4GH Framework for Responsible Sharing of Genomic and Health Related Data 16 CARE Principles for Indigenous Data Governance 17
Data quality & security: Store and process the data collected, used and transferred in a way that is accurate, verifiable, unbiased, proportionate, and current, so as to enhance their interoperability and replicability and also preserve their long-term searchability and integrity.Ensure feedback mechanisms on the utility, quality, security, and accuracy of data, and their annotations, with a view to improving quality and interoperability and appropriate reuse by others.CARE=Collective benefit, Authority to control, Responsibility, Ethics.GA4GH=Global Alliance for Genomics and Health.GloPID-R, Global Research Collaboration for Infectious Disease Preparedness.NCS=National Core Studies.